BAGGING, BOOSTING, and RANDOM FORESTS
1. Bagging and Random Forests
Recall that bagging is a random forest with m = p. So, here is just an example of a random forest.
> library(randomForest)
> rf = randomForest(mpg ~ .-name, data=Auto) # By default, m = p/3. But we can also choose our own m.
> rf # (m is the number of X-variables sampled at each node)
Call:
randomForest(formula = mpg ~ . - name, data = Auto, subset = Z)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2
Mean of squared residuals: 8.130213
% Var explained: 86.53
> importance(rf) # Measures reduction of the node’s impurity (diversity), if split by the given X-variable
IncNodePurity
cylinders 2142.2611
displacement 3135.1382
horsepower 1752.5234
weight 2399.9452
acceleration 524.3148
year 1196.4857
origin 546.0370
> varImpPlot(rf)
2. Cross-validation
> Z = sample(n,200)
> rf = randomForest(mpg ~ .-name, data=Auto, subset=Z)
> Yhat = predict(rf, newdata=Auto[-Z,])
> mean((Yhat - mpg[-Z])^2)
[1] 10.19022
# The mean-square error of prediction, estimated by the validation set cross-validation, is 10.19022.
3. Searching for the optimal solution
> dim(Auto)
[1] 392 9
# There are 9 variables overall in the data set, minus mpg and name = 7 variables. Let’s sample m = root of 7, rounded = 3.
> rf3 = randomForest(mpg ~ .-name, data=Auto, mtry=3) # mtry is m, the number of X-variables available at each node
> plot(rf3)
# How many trees to grow? The default is 500, but error is rather flat after 100.
# Random forest tool has multiple output:
> names(rf3)
[1] "call" "type" "predicted" "mse"
[5] "rsq" "oob.times" "importance" "importanceSD"
[9] "localImportance" "proximity" "ntree" "mtry"
[13] "forest" "coefs" "y" "test"
[17] "inbag" "terms"
# We would like to minimize the mean squared error and to maximize R2, the percent of total variation explained by the forest.
> which.min(rf3$mse)
[1] 147
> which.max(rf3$rsq)
[1] 147
# Alright, let’s use 147 trees whose results will get averaged in this random forest.
> rf3.147 = randomForest(mpg ~ .-name, data=Auto, mtry=3, ntree=147)
> rf3.147
Call:
randomForest(formula = mpg ~ . - name, data = Auto, mtry = 3, ntree = 147)
Type of random forest: regression
Number of trees: 147
No. of variables tried at each split: 3
Mean of squared residuals: 7.322948
% Var explained: 87.63
# This is an improvement in both MSE and R2, comparing with our first random forest.
# We can optimize both m and number of trees, by cross-validation.
> Z = sample(n,n-50) # I’m choosing a small test set to make the training set close to the whole data set
> cv.err = rep(0,7) # The optimal random forest may be too dependent on the sample size
> n.trees = rep(0,7)
> for (m in 1:7){
+ rf.m = randomForest( mpg ~ .-name, data=Auto[Z,], mtry=m )
+ opt.trees = which.min(rf.m$mse)
+ rf.m = randomForest( mpg ~ .-name, data=Auto[Z,], mtry=m, ntree=opt.trees )
+ Yhat = predict( rf.m, newdata=Auto[-Z,] )
+ mse = mean( (Yhat - mpg[-Z])^2 )
+ cv.err[m] = mse
+ n.trees[m] = opt.trees
+ }
> which.min(cv.err)
[1] 7
# 7? Apparently, bagging (m=p=7) was the best choice among random forests.
> plot(cv.err); lines(cv.err)
> cv.err
[1] 10.652190 9.368370 9.023726 9.000002 8.996304 9.248892 8.951198
> n.trees
[1] 112 494 318 208 484 319 293
# Result: here is the optimal random forest, which happened to reduce to bagging.
> rf.optimal = randomForest( mpg ~ .-name, data=Auto, mtry=7, ntree=293 )
> rf.optimal
Type of random forest: regression
Number of trees: 293
No. of variables tried at each split: 7
Mean of squared residuals: 7.407668
% Var explained: 87.81
> importance(rf.optimal)
IncNodePurity
cylinders 4820.4219
displacement 7672.7643
horsepower 2838.2213
weight 4603.0577
acceleration 647.7559
year 2861.6324
origin 133.9062